Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pnomolos/faster #2

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

pnomolos
Copy link

@pnomolos pnomolos commented Jun 4, 2017

@joelparkerhenderson After running way too many benchmarks and trying to come up with far too many possible algorithms I believe I have a winner - results here.

I tested the various algorithms (using benchmark-ipsa) against several sources (wikipedia articles and project gutenberg books). Not included in the benchmarks are also some tests against single characters, and against the entire set of characters. I realized we should probably add tests for various encodings as well, I'll try and add those in soonish.

The libunac implementation also blows everything else away (as I expected) but it needs some cleanup and a few more checks.

As well, per my comment (#1 (comment)) should I change the output of those two characters?

Algorithm three was a very interesting one as you'll notice from the results.

  def unaccent_three
    @unaccent_three_regex ||= Regexp.union(ACCENTMAP.keys)
    gsub(@unaccent_three_regex, ACCENTMAP)
  end

very low memory usage (as I believe it would be a single allocation in C) but speed results were all over the place. I'm curious if there's some sort of cutoff in terms of string length where we could alternate between the two, but I haven't had a chance to investigate that far.

@pnomolos
Copy link
Author

pnomolos commented Jun 4, 2017

@joelparkerhenderson Testing in all encodings is turning out to be more of a pain than I expected. What are your thoughts on this? Should unaccent only work on UTF-8 (or UTF-16, I'd imagine) strings, or should it operate on as many encodings as possible?

@pnomolos
Copy link
Author

pnomolos commented Jun 5, 2017

@joelparkerhenderson I've pulled in your latest changes and updated the benchmark.

@redtachyons
Copy link

redtachyons commented Aug 5, 2017

Is it possible to get this PR merged ASAP, addressable gem using this as a dependency and this takes lot of memory. I found this pr when when I was about to raise one with some optimizations which are already done in this PR
Thanks in advance

Ref redtachyons@2da4a4b

@joelparkerhenderson
Copy link
Member

Thanks! I'll go through these today & tomorrow. First glance it all looks fine.

@joelparkerhenderson
Copy link
Member

@pnomolos You're asking great questions. I propose that a reliable way for now is to do just UTF8, and also append that to the gem name i.e. name the gem sixarm_ruby_unaccent_utf8_libunac. I've talked with a couple people who prefer the greater specificity. I'll let you make the call; what would you like?

@redtachyons
Copy link

BTW it is better to gitignore coverage files

@joelparkerhenderson
Copy link
Member

@redtachyons Thanks, I'll ignore /coverage/ and /coverage.data, in an independent commit. I'll also change from listing the dependencies in Gemfile to using the "gemspec" line as above, in an independent commit. And also ignore the Gemfile.lock file.

@pnomolos
Copy link
Author

pnomolos commented Jul 5, 2022

@joelparkerhenderson We're still using this gem internally - if I rework this commit would you consider pulling it in and cutting a new version of the gem?

@joelparkerhenderson
Copy link
Member

Yes. What version of Ruby are you using? What I suggest, if you want, is I can get the gem up to date with Ruby 3. I have time this weekend. Then could you add your improvements on top of that? (Or do you prefer a different way?)

@pnomolos
Copy link
Author

pnomolos commented Jul 5, 2022

Hi Joel,

If you have time to do that it would be great, as we're hoping to move to Ruby 3 sometime later this year.

Thanks!

@pnomolos
Copy link
Author

Hi @joelparkerhenderson if you don't have bandwidth to get to this I'd be fine doing two PRs - one for Ruby 3 support and an updated one for this. Would that be OK with you?

Thanks,

Phil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants